Navigating the Data Lake: Unsupervised Structure Extraction for Text-formatted Data

نویسندگان

  • Yihan Gao
  • Silu Huang
  • Aditya Parameswaran
چکیده

Many organizations routinely accumulate automatically-generated semi-structured log file datasets; these datasets remain unused and occupy wasted space—this phenomenon has been termed as the “data lake” problem. One approach to put these datasets to use is to convert them into a structured relational format, following which they can be analyzed in conjunction with other datasets. To address this, we present CATAMARAN, an automatic structure extraction tool that requires no human supervision. CATAMARAN automatically identifies field and record endpoints, separates the structured parts from the unstructured noise or formatting, and can tease apart multiple structures from within a dataset, in order to efficiently extract structured relational datasets from semi-structured log files, at scale with high accuracy. Compared to unsupervised adaptations of supervised structure extraction tools developed in prior work, CATAMARAN makes fewer assumptions and achieves much higher accuracy. In particular, CATAMARAN can successfully extract structured information from all datasets used in prior work in supervised structure extraction, and can achieve 95% extraction accuracy on automatically collected log files from GitHub.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Navigating the Data Lake with Datamaran: Automatically Extracting Structure from Log Datasets

Organizations routinely accumulate semi-structured log datasets generated as the output of code; these datasets remain unused and uninterpreted, and occupy wasted space—this phenomenon has been colloquially referred to as “data lake” problem. One approach to leverage these semi-structured datasets is to convert them into a structured relational format, following which they can be analyzed in co...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Accurate Unsupervised Learning of Field Structure Models for Information Extraction

The applicability of current information extraction techniques is severely limited by the need for supervised training data. We demonstrate that for certain field structured extraction tasks, small amounts of prior knowledge can be used to effectively learn models in a primarily unsupervised fashion. Many text information sources exhibit a latent field structure: such documents can be viewed as...

متن کامل

کاهش ابعاد داده‌های ابرطیفی به منظور افزایش جدایی‌پذیری کلاس‌ها و حفظ ساختار داده

Hyperspectral imaging with gathering hundreds spectral bands from the surface of the Earth allows us to separate materials with similar spectrum. Hyperspectral images can be used in many applications such as land chemical and physical parameter estimation, classification, target detection, unmixing, and so on. Among these applications, classification is especially interested. A hyperspectral im...

متن کامل

Unsupervised Learning of Field Segmentation Models for Information Extraction

The applicability of many current information extraction techniques is severely limited by the need for supervised training data. We demonstrate that for certain field structured extraction tasks, such as classified advertisements and bibliographic citations, small amounts of prior knowledge can be used to learn effective models in a primarily unsupervised fashion. Although hidden Markov models...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016